Manuscript Title: Faster Protein Classification Using Suffix Trees Running Head: Protein Classification Using Suffix Trees Authors:
نویسندگان
چکیده
Motivation: Position-specific scoring matrices have been used extensively to recognize highly conserved protein regions. We present a method for accelerating these searches using a suffix tree data structure computed from the sequences to be searched. Methods: Building on earlier work that allows evaluation of a scoring matrix to be stopped early, the suffix tree-based method excludes many protein segments from consideration at once by pruning entire subtrees. Although suffix trees are usually expensive in space, the fact that scoring matrix evaluation requires an in-order traversal allows nodes to be stored compactly in memory and on the disk without significant loss of speed. Results: Our implementation requires as little as 12 bytes of disk storage per input symbol. Searches are accelerated by a factor of up to eleven under typical conditions. Availability: The package source code is available at: http://sequence.rutgers.edu/sat. Contact: [email protected] Abbreviations: EOS, end of sequence symbol; K, kilo; M, mega; MB, megabyte; PSSM position-specific scoring matrix.
منابع مشابه
Protein Family Classification Using Sparse Markov Transducers
We present a method for classifying proteins into families based on short subsequences of amino acids using a new probabilistic model called sparse Markov transducers (SMT). We classify a protein by estimating probability distributions over subsequences of amino acids from the protein. Sparse Markov transducers, similar to probabilistic suffix trees, estimate a probability distribution conditio...
متن کاملAccelerating Protein Classification Using Suffix Trees
Position-specific scoring matrices have been used extensively to recognize highly conserved protein regions. We present a method for accelerating these searches using a suffix tree data structure computed from the sequences to be searched. Building on earlier work that allows evaluation of a scoring matrix to be stopped early, the suffix tree-based method excludes many protein segments from con...
متن کاملCompact Suffix Trees Resemble PATRICIA Tries: Limiting Distribution of the Depth
Suffix trees are the most frequently used data structures in algorithms on words. In this paper, we consider the depth of a compact suffix tree, also known as the PAT tree, under some simple probabilistic assumptions. For a biased memoryless source, we prove that the limiting distribution for the depth in a PAT tree is the same as the limiting distribution for the depth in a PATRICIA trie, even...
متن کاملVariations on probabilistic suffix trees: statistical modeling and prediction of protein families
MOTIVATION We present a method for modeling protein families by means of probabilistic suffix trees (PSTs). The method is based on identifying significant patterns in a set of related protein sequences. The patterns can be of arbitrary length, and the input sequences do not need to be aligned, nor is delineation of domain boundaries required. The method is automatic, and can be applied, without...
متن کاملSequence Motif Identification and Protein Family Classification Using Probabilistic Trees
Efficient family classification of newly discovered protein sequences is a central problem in bioinformatics. We present a new algorithm, using Probabilistic Suffix Trees, which identifies equivalences between the amino acids in different positions of a motif for each family. We also show that better classification can be achieved identifying representative fingerprints in the amino acid chains.
متن کامل